AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsFeb-11-2026, 22:48:15 GMT

f5bf0ba0a17ef18f9607774722f5698c-Paper.pdf

dataset, language model, proc, (16 more...)

Country:

Oceania > New Zealand (0.04)
North America > United States > California > Monterey County > Pacific Grove (0.04)
North America > Canada > Ontario > Toronto (0.04)
(2 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Government > Regional Government (0.67)
Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

arXiv.org Artificial IntelligenceDec-11-2025

Financial Instruction Following Evaluation (FIFE)

Matlin, Glenn, Siddharth, null, JM, Anirudh, Shukla, Aditya, Hassan, Yahya, Chava, Sudheer

Language Models (LMs) struggle with complex, interdependent instructions, particularly in high-stakes domains like finance where precision is critical. We introduce FIFE, a novel, high-difficulty benchmark designed to assess LM instruction-following capabilities for financial analysis tasks. FIFE comprises 88 human-authored prompts and employs a verification system with chainable, verifiable constraints for fine-grained reward signals. We evaluate 53 models (proprietary, open-weight, open-source) in a zero-shot setting. Our key findings reveal a clear performance hierarchy: the top open-weight model (76.1 strict / 79.5 loose) surpasses the leading proprietary system (65.9 strict / 70.5 loose), while the best open-source models lag significantly (45.5 strict / 48.9 loose). However, even top-performing models struggle with FIFE's complex requirements, failing to achieve perfect compliance. We release our dataset and code as an open-source resource to promote research in Reinforcement Learning for the financial domain.

fin, large language model, machine learning, (19 more...)

2512.08965

Country: Europe > Austria > Vienna (0.14)

Genre: Research Report (0.87)

Industry: Banking & Finance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)

Pu, Sophia Xiao, Cheng, Sitao, Wang, Xin Eric, Wang, William Yang

Dynamic Evaluation for Oversensitivity in LLMs

arXiv.org Artificial IntelligenceOct-23-2025

Oversensitivity occurs when language models defensively reject prompts that are actually benign. This behavior not only disrupts user interactions but also obscures the boundary between harmful and harmless content. Existing benchmarks rely on static datasets that degrade overtime as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model's unique behavior. Building on this approach, we construct OVERBENCH, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 25 models. OVERBENCH provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.

large language model, machine learning, natural language, (18 more...)

2510.19005

Country: North America > Mexico (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Hathidara, Ashutosh, Yu, Julien, Schreiber, Sebastian

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

arXiv.org Artificial IntelligenceOct-10-2025

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

large language model, machine learning, natural language, (19 more...)

2507.03336

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Neural Information Processing SystemsAug-18-2025, 22:02:46 GMT

f5bf0ba0a17ef18f9607774722f5698c-Supplemental.pdf

artificial intelligence, dynamic evaluation, machine learning, (18 more...)

Country:

North America > Canada (0.14)
North America > United States > Alabama (0.05)
North America > United States > Texas (0.04)
(5 more...)

Industry: Government (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsAug-18-2025, 22:02:42 GMT

f5bf0ba0a17ef18f9607774722f5698c-Paper.pdf

large language model, machine learning, natural language, (20 more...)

Country:

Oceania > New Zealand (0.04)
North America > United States > California > Monterey County > Pacific Grove (0.04)
North America > Canada > Ontario > Toronto (0.04)
(3 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Government > Regional Government (0.67)
Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Communications (0.93)

Neural Information Processing SystemsMay-27-2025, 21:20:02 GMT

DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph

adaptive reasoning graph, dynamic evaluation, language model, (5 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceJun-7-2024

Dynamic Evaluation of Large Language Models by Meta Probing Agents

Zhu, Kaijie, Wang, Jindong, Zhao, Qinlin, Xu, Ruochen, Xie, Xing

Evaluation of large language models (LLMs) has raised great concerns in the community due to the issue of data contamination. Existing work designed evaluation protocols using well-defined algorithms for specific tasks, which cannot be easily extended to diverse scenarios. Moreover, current evaluation benchmarks can only provide the overall benchmark results and cannot support a fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we propose meta probing agents (MPA), a general dynamic evaluation protocol inspired by psychometrics to evaluate LLMs. MPA is the key component of DyVal 2, which naturally extends the previous DyVal~\citep{zhu2023dyval}. MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory on three basic cognitive abilities: language understanding, problem solving, and domain knowledge. These basic abilities are also dynamically configurable, allowing multifaceted analysis. We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. Our multifaceted analysis demonstrated the strong correlation between the basic abilities and an implicit Matthew effect on model size, i.e., larger models possess stronger correlations of the abilities. MPA can also be used as a data augmentation approach to enhance LLMs. Code is available at: https://github.com/microsoft/promptbench.

arxiv preprint arxiv, benchmark, llm, (13 more...)

2402.14865

Country:

Europe > Austria > Vienna (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

Herel, David, Mikolov, Tomas

Collapse of Self-trained Language Models

arXiv.org Artificial IntelligenceApr-2-2024

In various fields of knowledge creation, including science, new ideas often build on pre-existing information. In this work, we explore this concept within the context of language models. Specifically, we explore the potential of self-training models on their own outputs, akin to how humans learn and build on their previous thoughts and actions. While this approach is intuitively appealing, our research reveals its practical limitations. We find that extended self-training of the GPT-2 model leads to a significant degradation in performance, resulting in repetitive and collapsed token output.

iclr 2024, iteration, language model, (13 more...)

2404.02305

Country:

Europe > Czechia > Prague (0.05)
North America > United States > California > Monterey County > Pacific Grove (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)

Genre: Research Report > New Finding (0.70)

Industry: Law (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)